1 Credits

This tutorial uses materials that was created by the CANSSI Collaborative Research Team led by Vianey Leos Barajas and Marie Auger-Méthé. I thank Fanny Dupont, Ron Togunov, Natasha Klappstein, Arturo Esquivel, Marco Gallegos Herrada, and Sofia Ruiz Suarez for their contribution to this original material.

2 Tutorial goals

The goal of this tutorial is to explore ways to pre-prepare marine movement data. The primary learning objectives are to:

  1. Regularize movement tracks that are irregular using Fastloc GPS data as an example.
  2. Create predicted tracks from error-prone data using Argos data as an example.

3 General setup

First, let’s load the packages that we will need to complete the analyses. Of course you need to have them installed first.

library(momentuHMM) # Package for fitting HMMs, we use it for crawlWrap
library(tidyverse)  # data management
library(terra)
library(tidyterra)
library(ggspatial)  # plot the data
library(sf)         # spatial data processing
library(kableExtra) # produce visually appealing tables
library(geosphere)  # to calculate step lengths from lat/lon locations - just need it installed
library(conicfit)   # needs to be installed for crawlWrap
library(here)       # To help with sourcing

We are assuming that you are working from Madeira-Workshop repository main folder, and using here() to help find the good directory for each files. We will need the functions in the following file utility_functions.R.

source(here("D1-data-prep-ssm",
            "utility_functions.R"))

4 Regularising Fastloc GPS data

4.1 Narwhal movement data

We will analyze a dataset containing movement tracks of three narwhals tagged with Fastloc-GPS tags. The dataset was provided by Dr. Marianne Marcoux (Fisheries and Oceans, Canada). Dr. Marcoux provided the data only for this tutorial, please do not use for other purposes without their consent. Contact: .

For simplicity, we only examine the fastloc-GPS data from one week in August 2017.

Photo by Paul Nicklen
Photo by Paul Nicklen

First, let’s import the raw Fastloc GPS narwhal data and convert the time column to an appropriate date format.

tracks_gps_O <- read.csv("data/tracks_fastloc_gps.csv") %>%
  mutate(time = ymd_hms(time))

We can have a first look at the data, to get more familiar with it.

head(tracks_gps_O)
##        ID                time        x       y loc_class
## 1 T172062 2017-08-08 00:00:00 -80.6414 72.2694       GPS
## 2 T172062 2017-08-08 00:20:00 -80.6525 72.2728       GPS
## 3 T172062 2017-08-08 00:42:00 -80.6665 72.2719       GPS
## 4 T172062 2017-08-08 00:52:00 -80.6692 72.2674       GPS
## 5 T172062 2017-08-08 01:09:00 -80.6682 72.2582       GPS
## 6 T172062 2017-08-08 01:19:00 -80.6613 72.2574       GPS

The columns/variables are: - ID: Individual identifier - time: time of location - x: longitude - y: latiture - loc_class: all GPS

The data we obtain is often quite messy with some rows/records missing information and other records duplicated. We can filter records to keep only complete location data using !is.na(x) & !is.na(y). To remove duplicate records (same time, location, and location class), we will use the lag function from dplyr, which will use the value from the previous row so that we can compare to the current row.

tracks_gps_O <- tracks_gps_O %>% 
  # remove missing locations
  filter(!is.na(x) & !is.na(y),
         # remove identical records
         !(time == lag(time) & 
             x == lag(x) & 
             y == lag(y) & 
             loc_class == lag(loc_class)))

Let’s plot the data over the bathymetry and land layers, to get a sense of the .

4.1.1 Preparing data with covariates

To get a reference of where these narwhals are swimming, let’s import a land layer.

land <- st_read("data/NorthBaffin.shp")

To understand how to include covariates in our model, we will look at two covariates.

First we will look at bathymetry (i.e., depth of the ocean).

bathy <- rast("data/bathy_4_narwhals.tiff")

The raw bathymetry data used to create this raster was taken from the the GEBCO global ocean and land terrain model from https://pressbooks.bccampus.ca/ewemodel/chapter/getting-bathymetry/. Note that land values of the bathymetry layer are 0s. Let’s plot the narwhal movement data over the bathymetry and land layers.

ggplot() +
  geom_spatraster(data=bathy) +
  geom_sf(data=land, fill="beige") +
  geom_spatial_path(data = tracks_gps_O, 
                    aes(x = x, y = y, colour=factor(ID)), crs = 4326) +
  coord_sf(expand = FALSE)

4.2 Import and vizualize data

  • Selecting a time interval
  • Handling data gaps

4.3 Handling irregular data

4.3.1 Selecting a time interval for the HMM

An HMM assumes the observations are collected in discrete time and that there is no missing data in the predictor variables. When the data is irregular, there are two key decisions we must make, (1) the temporal resolution to use, and (2) how to address data gaps. The desired resolution depends predominantly on the biological question you are asking, as different behaviours and biological processes occur at different spatial and temporal scales (e.g., seasonal migration, daily movement between foraging and resting grounds, and fine scale foraging decisions). Generally, higher resolution data is preferred as it has more information. However, it is possible to have too-high of a resolution wherein information from fine-scale variability drowns out the signal from coarse-scale patterns of interest (e.g., seasonal migration). A good rule of thumb, is that you want 3-50 data points per behaviour. For behaviours spanning several hours, that roughly corresponds to a desired resolution between 2 min and 60 min.

First, let’s calculate the time difference between successive records using difftime and lead (compares current row to following row) and place these values in a new column called dt. Note that the time difference is in minutes (units = "mins"). For the last record of each individual (i.e., when ID != lead(ID)), we will set the value to NA.

# Calculate time difference between locations
tracks_gps_O <- tracks_gps_O %>%
  mutate(dt = ifelse(ID == lead(ID), # If next data row is same individual
                     # calculate time difference
                     difftime(lead(time), time, units = "mins"), NA))

Let’s see what resolutions may be possible in the data by looking at the most frequent time gaps.

# Visualise time differences (all and zoomed)
par(mfrow = c(1, 2))
hist(tracks_gps_O$dt, 1000, main = NA, xlab = "Time difference (min)")
hist(tracks_gps_O$dt, 1000, main = NA, xlab = "Time difference (min)",
     xlim = c(0,100))

# identify the most frequent dt
tracks_gps_O %>% 
  {table(.$dt)} %>% 
  sort(decreasing = TRUE) %>% 
  head()
## 
##  10  11  12  22   9  13 
## 400 145  90  64  63  63

We see that the most frequent time gap is \(10\) min, followed by \(11\), \(12\), \(22\), \(9\) and \(13\) min. We also see the majority of the gaps are \(< 60\) min, however some are in excess of \(600\) min. Because HMMs assume observations taken at regular time intervals, finer resolutions will contain more data gaps that would need to be interpolated. Frequent and large data gaps can be difficult to handle, especially as the number of missing data points approaches or exceeds the existing data. Let’s examine the potential data structure at different resolutions for the different animals.

We can now use the function p_na (in the script utility_functions.R) to look at the proportion of NAs we would get with 10, 20, 30, and 60 min resolution.

# summarise track dt
tracks_gps_O %>% 
  group_by(ID) %>% 
  summarise(p_NA_10m = p_na(time, 10),     # 10 min 
            p_NA_20m = p_na(time, 20),     # 20 min 
            p_NA_30m = p_na(time, 30),     # 30 min 
            p_NA_60m = p_na(time, 60)) %>% # 60 min
  # return formatted table
  kable(digits = 3, col.names = c("Narwhal ID", paste(c(10,20,30,60), "m"))) %>%
  kable_styling() %>% 
  add_header_above(c("", "Resolution" = 4))
Resolution
Narwhal ID 10 m 20 m 30 m 60 m
T172062 0.486 0.236 0.191 0.152
T172064 0.502 0.203 0.120 0.074
T172066 0.488 0.230 0.185 0.160

Here we see that the \(10\) min interval, around \(50\%\) of the locations would be missing.

This is a lot, but choosing finer resolutions would likely bias the results.

For this tutorial, we will use a \(10\) min resolution, though in many cases we may want to be more conservative and use a 30 min resolution.

4.3.2 Handling data gaps

There are several ways to deal with data gaps:

  1. Split tracks
  2. Interpolate locations
  3. Fill the gaps with NAs
  4. Multiple imputation

To create the data used in this tutorial, we combined options 1 and 2. The basic steps are:

  1. Splitting tracks
  2. Interpolating locations within each track segment

4.3.3 Splitting tracks

One way to account for missing data is to split the track where there are large gaps (i.e., assign each track segment a new individual ID). This strategy is particularly appropriate when you have long enough gaps for which interpolation method are unlikely to perform well. We can split the tracks when the gaps larger than a predetermined threshold by iterating the ID column. Here, we will use a function (found in utility_functions.R) to split the tracks. We define the maximum allowable gap (at which point it will start a new segment), as well as the shortest allowable track segment.

These are somewhat arbitrary decisions, and depend on your subsequent choices for regularisation. In this tutorial, we will be interpolating missing locations (within each segment) and so we only want to allow gaps that can reasonably be predicted. We’re using a 10-minute resolution, so we allow a 60 minute gap (i.e., we assume we can predict 6 missing locations), and we want each segment to be at least 20 locations long so that we have enough information about state transitions.

# Use function from utility_function.R to split data at gaps > 2 hours
data_split <- split_at_gap(data = tracks_gps_O, 
                           max_gap = 60, 
                           shortest_track = 20)

An alternative to this approach is to fill large gaps with NAs, but this can be problematic if there is any covariate-dependence on the transition probabilities. You can find some code further below, that handles missing data by setting the gaps to NAs, using the package adehabitatLT.

4.3.4 Interpolation (correlated random walk)

Once the track is split, there is often still irregularity within each segments, and we want to interpolate or predict new locations to form a regular grid of observations.

The simplest approach is to use linear interpolation between missing times, but a better option is to predict locations from a continuous-time correlated random walk (CTCRW). momentuHMM contains wrapper functions to interpolate missing locations by fitting a CTCRW to the data based on the crawl package by Devin Johnson and Josh London. There are many options to fit the CTCRW model, and a detailed tutorial for analysis with crawl is available here: https://jmlondon.github.io/crawl-workshop/crawl-practical.html. Let’s try to fit the most basic model using the wrapper function crawlWrap. In the most basic form, we only need to provide tracking data with the columns ID, time, x, and y and specify the desired temporal resolution.

First, let us transform the data into an sf object. crawlWrap can also take a data.frame as an argument but that would imply renaming some of our columns. It is easier to just transform the data into an sf object.

data_sf <- data_split %>%
  st_as_sf(coords = c("x", "y")) %>% # converts to an sf object
  st_set_crs(4326) %>% # define CRS
  st_transform(2962) # reproject data to a UTM

Now we can fit the CTCRW to each track segment and create a dataframe of predicted locations.

# crawl can fail to fit periodically, so I recommend always setting a seed 
set.seed(12)

# fit crawl
crwOut <- crawlWrap(obsData = data_sf, timeStep = "10 mins", theta = c(7, 0))
## Fitting 22 track(s) using crawl::crwMLE...
## Individual T172062-1...
## Individual T172062-2...
## Individual T172062-3...
## Individual T172062-4...
## Individual T172062-5...
## Individual T172062-6...
## Individual T172062-7...
## Individual T172064-1...
## Individual T172064-2...
## Individual T172064-3...
## Individual T172064-4...
## Individual T172064-5...
## Individual T172064-6...
## Individual T172064-7...
## Individual T172064-8...
## Individual T172066-1...
## Individual T172066-2...
## Individual T172066-3...
## Individual T172066-4...
## Individual T172066-5...
## Individual T172066-6...
## Individual T172066-7...
## DONE
## 
## Predicting locations (and uncertainty) at 10 mins time steps for 22 track(s) using crawl::crwPredict... DONE
plot(crwOut, animals = "T172062-3", ask = FALSE)


# Get predicted tracks from crawl output
data <- crwOut$crwPredict[which(crwOut$crwPredict$locType == "p"),
                              c( "ID", "mu.x", "mu.y", "time")]
colnames(data) <- c( "ID","x", "y", "time")

Notice how the predicted tracks do not make perfectly straight lines through missing sections, which is an improvement on simple linear interpolation methods.

4.3.5 Pad gaps with NAs

We mentioned previously that another strategy to address data gaps is to leave the data streams (i.e., step length and turning angle) as NAs. The maximum size of a gap to allow depends on the frequency of the missing data, frequency of locations, study species, and behaviours of interest. Voiding gaps can be used on its own or in conjunction with splitting tracks. The package adehabitatLR has a function setNA dedicated to it.

Let us first install and load the package.

We will apply this to the split tracks that we used previously used in the tutorial. Here, instead of using crawl to interpolate missing locations, we are simply creating a dataframe with the missing locations set to NA (i.e., creating a regular grid of observations with some NAs).

# Create adehabitat object, containing the trajectory padded with NAs
data_ade <- setNA(ltraj = as.ltraj(xy = data_split[, c("x", "y")], 
                                   date = data_split$time, 
                                   id = data_split$ID), 
                  date.ref = data_split$time[1], 
                  dt = 10, tol = 5, units = "min")

# Transform back to dataframe
data_na <- ld(data_ade)[, c("id", "x", "y", "date")]
colnames(data_na) <- c("ID", "x", "y", "time")

4.3.6 Multiple Imputation

Multiple imputation works by first fitting a CTCRW model to the original data, second, drawing (i.e., simulating) a number of realisations of the position process based on the CTCRW model, third, fitting HMMs to each of the simulated realisations, and finally, pooling the estimated parameters. momentuHMM has several functions to implement multiple imputation. The function MIfitHMM can be used both to simulate realisations of a fitted CTCRW and fit HMMs to each one. The number of simulations is specified with nSims. We can simulate realisations without fitting HMMs by setting fit = FALSE.

Here, let’s use first fit a CTCRW model on complete tracks (not segmented) that we fit in the tutorial to simulate 4 tracks using MIfitHMM and plot them over the original track.

set.seed(12)

# transform the tracks into an sf object
tracks_gps_sf_O <- tracks_gps_O %>%
  st_as_sf(coords = c("x", "y")) %>% # converts to an sf object
  st_set_crs(4326) %>% # define CRS
  st_transform(2962) # reproject data to a UTM

#Fit the correlated random walk, MIfithmm takes a crwData object
crw_gps_10 <- crawlWrap(obsData = tracks_gps_sf_O, timeStep = "10 mins")
## Fitting 3 track(s) using crawl::crwMLE...
## crawl 2.3.0 (2022-10-06) 
##  Demos and documentation can be found at our new GitHub repository:
##  https://dsjohnson.github.io/crawl_examples/
##  
##  WARNING!!! v. 2.3.0 will be the last version of {crawl} hosted on CRAN.
##  see 'https://github.com/NMML/crawl' for any future bug fixes.
## 
## Attaching package: 'crawl'
## The following object is masked from 'package:purrr':
## 
##     flatten
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## Loading required package: rngtools
## Individual T172062...
## Beginning SANN initialization ...
## Beginning likelihood optimization ...
## Individual T172064...
## Beginning SANN initialization ...
## Beginning likelihood optimization ...
## Individual T172066...
## Beginning SANN initialization ...
## Beginning likelihood optimization ...
## DONE
## 
## Predicting locations (and uncertainty) at 10 mins time steps for 3 track(s) using crawl::crwPredict... DONE
# simulate 4 realisations of the 10 min GPS CTCRW model
MI_sim_gps <- MIfitHMM(crw_gps_10, nSims = 4, fit = FALSE)
## Drawing 4 realizations from the position process using crawl...
## Running simulator for individual T172062...
## Running simulator for individual T172064...
## Running simulator for individual T172066...
## DONE
## Drawing imputation 1...
## Drawing imputation 2...
## Drawing imputation 3...
## Drawing imputation 4...
## DONE
# plot locations for first narwhal
# filter first ID from original data
track <- tracks_gps_sf_O %>% 
  mutate(x = st_coordinates(tracks_gps_sf_O)[,"X"], 
         y = st_coordinates(tracks_gps_sf_O)[,"Y"]) %>% 
  filter(ID == "T172062")
# filter first ID for each simulation
sim_tracks <- lapply(MI_sim_gps$miData, function(x){
  filter(x, ID == "T172062")})

# plot original track for first narwhal
plot(track$x, track$y, col = "red", xlab = "x", ylab = "y", asp = 1, type = "l")

# plot each simulated track
mute <- mapply(function(data, col) {
                 points(y ~ x, data = data, col = col, type = "l")
               }, data = sim_tracks, 
               col = list("cadetblue1", "cadetblue2", "cadetblue3", "cadetblue4"), 
               SIMPLIFY = FALSE)

Notice how in some areas the different simulations have generally good agreement in the likely location during gaps, while in others they diverge. Multiple imputation can be particularly powerful if we want to incorporate environmental variables, as spatially explicit variables can be extracted for each simulated track to sample the most likely conditions encountered by the animal.